This short R Markdown file goes along with Section 3 of the course notes. We will be using the function trendscatter
in the package s20x
so you will need to install that package before going through these examples. Also I will use the pairs.plus
function in one of the examples which is contained in Regression.RData
directory I e-mailed you. You need to read that into R by selecting Open File… from File drop-down menu in order to have access to the pairs.plus
function.
We begin by reading in the paddlefish data used in the first part of the Section 3 notes.
Paddle = read.csv("http://course1.winona.edu/bdeppa/Regression/Data/Paddlefish%20(clean).csv",header=T)
names(Paddle)
## [1] "Age" "Length" "Weight"
head(Paddle)
## Age Length Weight
## 1 6 87.63 2.58
## 2 5 87.63 3.32
## 3 5 89.54 3.35
## 4 3 67.31 1.69
## 5 4 77.47 2.55
## 6 6 100.33 4.58
str(Paddle)
## 'data.frame': 183 obs. of 3 variables:
## $ Age : int 6 5 5 3 4 6 7 7 8 7 ...
## $ Length: num 87.6 87.6 89.5 67.3 77.5 ...
## $ Weight: num 2.58 3.32 3.35 1.69 2.55 4.58 5.34 4.49 4.53 5.41 ...
summary(Paddle)
## Age Length Weight
## Min. : 1.000 Min. : 43.82 Min. : 0.250
## 1st Qu.: 4.000 1st Qu.: 83.19 1st Qu.: 2.570
## Median : 5.000 Median : 90.81 Median : 3.470
## Mean : 5.541 Mean : 90.86 Mean : 4.248
## 3rd Qu.: 7.000 3rd Qu.:100.65 3rd Qu.: 5.115
## Max. :18.000 Max. :140.97 Max. :20.300
We will now use the trendscatter
command in the s20x
library to construct a scatterplot with smoothers added to visualize \(\small{E(Weight|Length)}\) and \(\small{SD(Weight|Length)}\).
require(s20x)
## Loading required package: s20x
trendscatter(Weight~Length,data=Paddle)
We can clearly see that \(\small{E(Weight|Length)}\) is nonlinear, i.e. exhibits curvature, and that the \(\small{Var(Weight|Length)}\) and/or \(\small{SD(Weight|Length)}\) is NOT constant. We can control or explore the effect of window width on the smoothing process by varying the fraction of the observations used in each window. The default setting in the trendscatter
function is f = 0.50
, i.e. use 50% of the data in each window. Keep in mind that the weighting function within in window will downweight the points as we move away from the target point.
trendscatter(Weight~Length,data=Paddle,f=0.10) # Too noisy!
trendscatter(Weight~Length,data=Paddle,f=0.25)
trendscatter(Weight~Length,data=Paddle,f=0.75)
trendscatter(Weight~Length,data=Paddle,f=1) # Too smooth!
These data contain measurements and ages (rings) of 4,175 abalones. We will use these data later in the course. The function pairs.plus
in the Regression.RData
directory creates a scatterplot matrix with histograms and smoothers added to visualize the distribution of each variable and the relationship between each pair of variables in these data.
Abalone = read.csv("http://course1.winona.edu/bdeppa/Regression/Data/abalone.csv")
pairs.plus(Abalone)
In this example we examine the relationship between cell perimeter and cell area as a function of cell radius. We will also conduct these investigations by conditioning on the tumor type, i.e. malignant (M) or benign (B). First we read in and inspect the dataset.
BreastDiag = read.csv("http://course1.winona.edu/bdeppa/Regression/Data/BreastDiag.txt",header=T)
names(BreastDiag)
## [1] "Id" "Diagnosis" "Radius" "Texture" "Perimeter"
## [6] "Area" "Smoothness" "Compactness" "Concavity" "ConcavePts"
## [11] "Symmetry" "FracDim" "serad" "setex" "seperi"
## [16] "searea" "sesmoo" "secomp" "seconc" "seconpts"
## [21] "sesym" "sefd" "wrad" "wtex" "wperi"
## [26] "warea" "wsmoo" "wcomp" "wconc" "wconpts"
## [31] "wsym" "wfd"
head(BreastDiag)
## Id Diagnosis Radius Texture Perimeter Area Smoothness
## 1 842302 M 17.99 10.38 122.80 1001.0 0.11840
## 2 842517 M 20.57 17.77 132.90 1326.0 0.08474
## 3 84300903 M 19.69 21.25 130.00 1203.0 0.10960
## 4 84348301 M 11.42 20.38 77.58 386.1 0.14250
## 5 84358402 M 20.29 14.34 135.10 1297.0 0.10030
## 6 843786 M 12.45 15.70 82.57 477.1 0.12780
## Compactness Concavity ConcavePts Symmetry FracDim serad setex seperi
## 1 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.17000 0.1578 0.08089 0.2087 0.07613 0.3345 0.8902 2.217
## searea sesmoo secomp seconc seconpts sesym sefd wrad wtex
## 1 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33
## 2 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41
## 3 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53
## 4 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50
## 5 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67
## 6 27.19 0.007510 0.03345 0.03672 0.01137 0.02165 0.005082 15.47 23.75
## wperi warea wsmoo wcomp wconc wconpts wsym wfd
## 1 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
## 2 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
## 3 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
## 4 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
## 5 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
## 6 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440
Next we investigate the relationship between the radius and cell area & cell perimeter. For both we consider the theoretical relationships assuming that tumor cells are circular/spherical and compare these theoretical results with those obtained by scatterplot smoothing. We also examine these relationships conditional on the tumor type, i.e. benign or malignant.
The theoretical relationships are as follows:
Area: \(\small Area = \pi (Radius)^2\) Perimeter: \(\small Perimeter = 2\pi (Radius)\)
trendscatter(Area~Radius,data=BreastDiag)
plot(Area~Radius,data=BreastDiag,xlab="Cell Radius",ylab="Cell Area",main="Comparing LOESS Smooth to Theoretical Model")
lines(sort(BreastDiag$Radius),pi*sort(BreastDiag$Radius^2),lty=2,col="red",lwd=3)
lines(lowess(BreastDiag$Radius,BreastDiag$Area,f=.2),lty=3,col="blue",lwd=3)
trendscatter(Area~Radius,data=BreastDiag[BreastDiag$Diagnosis=="B",],main="Benign Cells")
trendscatter(Area~Radius,data=BreastDiag[BreastDiag$Diagnosis=="M",],main="Malignant Cells")
How does the relationship between cell area and cell radius differ between benign and malignant tumor cells?
trendscatter(Perimeter~Radius,data=BreastDiag)
plot(Perimeter~Radius,data=BreastDiag,xlab="Cell Radius",ylab="Cell Perimeter",main="Comparing LOESS Smooth to Theoretical Model")
lines(sort(BreastDiag$Radius),2*pi*sort(BreastDiag$Radius),lty=2,col="red",lwd=3)
lines(lowess(BreastDiag$Radius,BreastDiag$Perimeter,f=.2),lty=3,col="blue",lwd=3)
trendscatter(Perimeter~Radius,data=BreastDiag[BreastDiag$Diagnosis=="B",],main="Benign Cells")
trendscatter(Perimeter~Radius,data=BreastDiag[BreastDiag$Diagnosis=="M",],main="Malignant Cells")
How does the relationship between cell perimeter and cell radius differ between benign and malignant tumor cells?